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Abstract. We consider the setting of ontological database access, where an A- 
box is given in form of a relational database D and where a Boolean conjunctive 
query q has to be evaluated against D modulo a T-box E formulated in DL- 
Lite or Linear Datalog*. It is well-known that [E, q) can be rewritten into an 
equivalent nonrecursive Datalog program P that can be directly evaluated over 
D. However, for Linear Datalog* or for DL-Lite versions that allow for role 
inclusion, the rewriting methods described so far result in a nonrecursive Datalog 
program P of size exponential in the joint size of E and q. This gives rise to 
the interesting question of whether such a rewriting necessarily needs to be of 
exponential size. In this paper we show that it is actually possible to translate 
{E,q) into a polynomially sized equivalent nonrecursive Datalog program P. 



1 Introduction 



1.1 Motivation 

This paper is about query rewriting in the context of ontological database access. Query 
rewriting is an important new optimization technique specific to ontological queries. 
The essence of query rewriting, as will be explained in more detail below, is to com- 
pile a query and an ontological theory (usually formulated in some description logic or 
rule-based language) into a target query language that can be directly executed over a 
relational database management system (DBMS). The advantage of such an approach 
is obvious. Query rewriting can be used as a preprocessing step for enabling the ex- 
ploitation of mature and efficient existing database technology to answer ontological 
queries. In particular, after translating an ontological query into SQL, sophisticated 
query-optimization strategies can be used to efficiently answer it. However, there is 
a pitfall here. If the translation inflates the query excessively and creates from a rea- 
sonably sized ontological query an enormous exponentially sized SQL query (or SQL 
DDL program), then the best DBMS may be of little use. 



* This paper extends a shorter version presented at DL 2011, 24th International Workshop on 
Description Logics by mainly expanding Section 3. Further extended and/or improved versions 
will be posted as they arise to arXive-CORR athttp://arxiv.org/abs/1105.3757 



1.2 Main Results 



We show that polynomially sized query rewritings into nonrecursive Datalog exist in 
specific settings. Note that nonrecursive Datalog can be efficiently translated into SQL 
with view definitions (SQL DDL), which, in turn, can be directly executed over any 
standard DBMS. Our results are — for the time being — of theoretical nature and 
we do not claim that they will lead to better practical algorithms. This will be studied 
via implementations in the next future. Our main result applies to the setting where 
ontological constraints are formulated in terms of tuple-generating dependencies ( tgds), 
and we make heavy use of the well-known chase procedure [17, 14]. For definitions, 
see Section 2. The result after chasing a tgd set S over a database D is denoted by 
chase{D, S). 

Consider a set S of tgds and a database D over a joint signature TZ. Let q he a 
Boolean conjunctive query (BCQ) issued against {D, S). We would like to transform q 
into a nonrecursive Datalog query P such that {D, S) qiff D \= P. We assume here 
that P has a special propositional goal goal, and D \= P means that goal is derivable 
from P when evaluated over D. Let us define an important property of classes of tgds. 

Definition 1. Polynomial witness property (PWP). The PWP holds for a class C of 
tgds if there exists a polynomial 7 such that, for every finite set S ^ C of tgds and each 
BCQ q, the following holds: for each database D, whenever {D, S) \= q, then there is 
a sequence of at most \q\) chase steps whose atoms already entail q. 

Our main technical result, which is more formally stated and proven in Section 3, 
is as follows. 

Theorem 1. Let S be a set of tgds from a class C enjoying the PWP. Then each BCQ 
q can be rewritten in polynomial time into a nonrecursive Datalog program P of size 
polynomial in the joint size of q and S, such that for every database D, (D, S) |= q if 
and only if D \= P. Moreover, the arity of P is max(a + 1,3), where a is the maximum 
arity of any predicate symbol occurring in S, in case a sufficiently large linear order 
can be accessed in the database, or otherwise by 0(max(a +1,3)- log7Ti), where m 
is the joint size of q and S. 

1.3 Other Resuhs 

From this result, and from already established facts, a good number of further rewritabli- 
ity results for other formalisms can be derived. In particular, we can show that conjunc- 
tive queries based on other classes of tgds or description logics can be efficiently trans- 
lated into nonrecursive Datalog. Among these formalisms are: linear tgds, originally 
defined in [5] and equivalent to inclusion dependencies, various major versions of the 
well-known description logic DL-Lite [9, 20], and sticky tgds [8] as well as sticky-join 
tgds [6, 7]. We will just give an overview and very short explanations of how each of 
these rewritability results follows from our main theorem. A more detailed treatment is 
planned for a future version of this paper. 



1.4 Structure of the Paper 

The rest of the paper is structured as follows. In Section 2 we state a few preliminaries 
and simplifying assumptions. In Section 3, we give a rather detailed proof sketch of 
the main result. Section 4, contains the other results following from the main result. A 
brief overview of related work concludes the paper in Section 5. 

2 Preliminaries and Assumptions 

We assume the reader to be familiar with the terminology of relational databases and 
the concepts of conjunctive query (CQ) and Boolean conjunctive query (BCQ). For 
simplicity, we restrict our attention to Boolean conjunctive queries q. However, our 
results can easily be reformulated for queries with output, see Remark 3 after the proof 
of Theorem 1 . 

Given a relational schema TZ, a tuple-generating dependency (tgd) ct is a first-order 
formula of the form VXVr <?(X, Y) 3Z9{X, Z), where <P{X, Y) and ^{X, Z) 
are conjunctions of atoms over TZ, called the body and the head of a, denoted body (a) 
and head{a), respectively. We usually omit the universal quantifiers in tgds. Such a is 
satisfied in a database D for TZ iff, whenever there exists a homomorphism h that maps 
the atoms of ^(X, Y) to atoms of D, there exists an extension h' of h that maps the 
atoms of •^'(X, Z) to atoms of D. All sets of tgds are finite here. We assume in the rest 
of the paper that every tgd has exactly one atom and at most one existentially quantified 
variable in its head. A set of tgds is in normal form if the head of each tgd consists 
of a single atom. It was shown in [4, Lemma 10] that every set S of TGDs can be 
transformed into a set S' in normal form of size at most quadratic in | | , such that E 
and S' are equivalent with respect to query answering. The normal form transformation 
shown in [4] can be achieved in logarithmic space. It is, moreover, easy to see that this 
very simple transformation preserves the polynomial witness property. 

For a database D for TZ, and a set of tgds S on TZ, the set of models of D and E, 
denoted mods{D, E), is the set of all (possibly infinite) databases B such that (i) DC B 
and (ii) every cr G is satisfied in B. The set of answers for a CQ qXo D and S, denoted 
ans{q, D, S), is the set of all tuples a such that a G q{B) for all B G mods{D, S). The 
answer for a BCQ q to D and S is yes iff the empty tuple is in ans{q, D, S), also 
denoted as D U S \= q. 

Note that, in general, query answering under tgds is undecidable [2], even when the 
schema and tgds are fixed [4]. Query answering is, however, decidable for interesting 
classes of tgds, among which are those considered in the present paper. 

The chase procedure was introduced to enable checking implication of dependen- 
cies [17], and later also for checking query containment [14]. It is a procedure for repair- 
ing a database relative to a set of dependencies, so that the result of the chase satisfies 
the dependencies. By "chase", we refer both to the chase procedure and to its output. 
The chase comes in two flavors: restricted and oblivious, where the restricted chase one 
applies tgds only when they are not satisfied (to repair them), while the oblivious chase 
always applies tgds (if they produce a new result). We focus on the oblivious one, since 
it makes proofs technically simpler The (oblivious) tgd chase rule defined below is the 
building block of the chase. 



TGD Chase Rule. Consider a database D for a relational schema TZ, and a tgd a 
on TZ of the form <P{X,Y) 3Z^{X, Z). Then, a is applicable to D if there 
exists a homomorphism h that maps the atoms of 1^) to atoms of D. Let a be 
applicable to D, and hi be a homomorphism that extends h as follows; for each G 
X, /ii(Xi) = h{Xi); for each G Z, hi{Zj) = Zj, where Zj is a fresh null value 
(i.e., a Skolem constant) different from all nulls already introduced. The application of 
a on D adds to D the atom hi{>I'{X, Z)) if not already in D (which is possible when 
Z is empty). ■ 

The chase algorithm for a database D and a set of tgds E consists of an exhaustive 
application of the tgd chase rule in a breadth-first (level-saturating) fashion, which leads 
as result to a (possibly infinite) chase for D and S. Formally, the chase of level up 
to of D relative to S, denoted chase^{D^E), is defined as D, assigning to every 
atom in D the (derivation) level 0. For every fc > 1, the chase of level up to k of D 
relative to S, denoted chase'' (D, S), is constructed as follows: let /i, . . . , /„ be all 
possible images of bodies of tgds in S relative to some homomorphism such that (i) 
h, ■ ■ ■ , In chase'^^ {D ^ S) and (ii) the highest level of an atom in some /; is fc — 1; 
then, perform every corresponding tgd application on chase''~^ {D, S), choosing the 
applied tgds and homomorphisms in a linear and lexicographic order, respectively, and 
assigning to every new atom the (derivation) level k. The chase of D relative to S, 
denoted chase{D^ E), is thus the limit of chase' [D, E) for fc oo. 

The (possibly infinite) chase relative to tgds is a universal model, i.e., there exists 
a homomorphism from chase{D, E) onto every B E mods{D, E) [11,4]. This result 
implies that BCQs q over D and E can be evaluated on the chase for D and E, i.e., 
DlJ E 1= (7 is equivalent to chase {D, E) |= q. 

A chase sequence of length n based on D and E is a sequence of n atoms such that 
each atom is either from D or can be derived via a single application of some rule in E 
from previous atoms in the sequence. If S is such a chase sequence and q a conjunctive 
query, we write S \= qif there is a homomorphism from q to the set of atoms of S. 

We assume that every database has two constants, and 1, that are available via 
the unary predicates Zero and One, respectively. Moreover, each database has a binary 
predicate Neq such that Neq(a, b) is true precisely if a and b are distinct values. 

We finally define N -numerical databases. Let D be a database whose domain does 
not contain any natural numbers. We define Dn as the extension of D by adding the 
natural numbers 0, 1, . . . , X to its domain, a unary relation Num that contains exactly 
the numbers 1, . . . ,N, binary order relations Succ and < on 0, 1, . . . , X, expressing 
the natural successor and "<" orders on N, respectively. ^ We refer to as the 
N-numerical extension of D, and, a so extended database as N-numerical database. 
We denote the total domain of a numerical database Dn by domAr(D) and the non- 
numerical domain (still) by dom(£'). Standard databases can always be considered to 
be iV-numerical, for some large N by the standard type integer, with the < predicate 
(and even arithmetic operations). A number maxint corresponding to iV can be defined. 



^ Of course, if dom(_D) already contains some natural numbers we can add a fresh copy of 
{0, 1, . . . ,iV} instead. 



3 Main Result 



Our main result is more formally stated as follows: 

Theorem 1. Let C be a class of tgds in normal form, enjoying the polynomial wit- 
ness property and let 7 be the polynomial bounding the number of chase steps ( with 
7(ni, 71.2) > max(ni, 712), for all naturals ni, n2). For each set S <Z C of tgds and 
each Boolean CQ q, one can compute in polynomial time a nonrecursive Datalog pro- 
gram P of polynomial size in \S\ and \q\, such that, for every database D it holds 
D,I] \= q if and only if D \= P. Furthermore: 

(a) For N -numerical databases D, where N > 7(|-£'|, \q\), the arity of P is max(a + 
1, 3), where a is the maximum arity of any predicate symbol occurring in S; 

(b) otherwise (for non-numerical databases), the arity of P is 0(max(a + 1,3) • 
log7(|i7|, \q\)), where a is as above. 

We note that N is polynomially bounded in jZ"! and |g| by the polynomial 7 that 
only depends on C. 

The rest of this section is dedicated to a detailed proof of Theorem 1 . 

Proof. We first describe the construction of a Datalog program P of arity a + fc + 4, 
where k is the maximum number of tuples in any left hand side of a chase rule. We 
explain afterwards, how the arity can be reduced to max(a + 1, 3). The program P 
checks whether there is a chase sequence S ~ ti, . . . , tjv with respect to D and S 
and a homomorphism h from q to (the set of atoms of) S. To this end, P consists of 
one large rule rgoai of polynomial size in and some shorter rules that define auxiliary 
relations and will be explained below. 

The aim of rgoai is to guess the chase sequence S and the homomorphism q at the 
same time. We recall that N does not depend on the size of D but only on jZ"! and 
|g| and thus rgoai can well be as long as the chase sequence and q together. One of the 
advantages of this approach is that we only have to deal with those null values that 
are actually relevant for answering the query. Thus, at most A'' null values need to be 
represented. 

One might try to obtain rgoai by just taking one atom Ai for each tuple ti of S and 
one atom for each atom of q and somehow test that they are consistent. However, it is 
not clear how consistency could possibly be checked in a purely conjunctive fashion.'* 
There are two ways in which disjunctive reasoning is needed. First, it is not a priori 
clear on which previous tuples, tuple t; will depend. Second, it is not a priori clear to 
which tuples of S the atoms of q can be mapped. 

To overcome these challenges we use the following basic ideas. 

(1) We represent the tuples of S (and the required tuples of D) in a symbolic fashion, 
utilizing the numerical domain. 

(2) We let P compute auxiliary predicates that allow us to express disjunctive relation- 
ships between the tuples in S. 

■* Furthermore, of course, there are no relations to which the atoms Ai could possible be 
matched. 



Example 1. We illustrate the proof idea with a very simple running example, shown in 
Figure 1. 



(a) S: 

(71 : Ri(X,Y) ^3Z R4X,Y,Z) 
(72) i?2(F, Z) ^ 3X Ri{X, y, Z) 
as: Ri{X,Z) ^JY Ra{X,Y,Z) 

Gi-. R4iXi,Yi,Zi),R4{X2,Y2,Z2) 



(b) q:Rs{X,Y),R,{Y,X) 

(c) D 
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Fig. 1. Simple example with (a) a set E of tgds, (b) a query q and (c) a database D. 



A possible chase sequence in this example is shown in Figure 2(a). The mapping X t-> a 
and Y ^ g, maps R^{X, Y) to and R3,{Y, X) to t%, thus satisfying q. Before we de- 
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Fig. 2. (a) Example chase sequence, (b) its extension and (c) its encoding. t2 is obtained by 
applying ai to ti. Likewise t4 and ts are obtained by applying CT2 to ts and 0-4 to t2 and t4, 
respectively. 



scribe the proof idea in more detail, we fix some notation and convenient conventions. 

Notation and conventions. Let C be a class of tgds enjoying the PWP, let 17 be a set 
of tgds from C, and let q be a BCQ. Let i?i , . . . Rm be the predicate symbols occurring 
in S or in q. We denote the number of tgds in S by £. 

Let Af:=7(|i7|,|g|) where 7 is as in Definition 1, thus N is polynomial in jZ'l and 

By definition of N, if {D, S) ^ q, then q can be witnessed by a chase sequence 
r of length < N. Our assumption that 7(711,712) > max(ni,n2), for every ni,n2, 
guarantees that N is larger than (i) the number of predicate symbols occurring in S, (ii) 
the cardinality |q| of the query, and (iii) the number of rules in S. 

For the sake of a simpler presentation, we assume that all relations in U have the 
same arity a and all rules use the same number k of tuples in their body. The latter 
can be easily achieved by repeating tuples, the former by filling up shorter tuples by 
repeating the first tuple entry. Furthermore, we only consider chase sequences of length 
N. Shorter sequences can be extended by adding tuples from D. 

Example 2. Example 1 thus translates as illustrated in Figure 3. The (extended) chase 
sequence is shown in Figure 2 (b). The query q is now satisfied by the mapping X 1-^ a, 
Y g,U 1-^ g,V 1-^ a, thus mapping -R5(-'^, Y, X) to and RsiY, X, Y) to te. 



(a) E: 

ai: Ri{X,Y,X),Ri{X,Y,X) ^3Z R4(X,Y,Z) (b) q : R5{X,Y,U), RaiY, X,V) 
(72 : R2{Y,Z,Y),R2{Y,Z,Y) -^3X R4{X, Y, Z) (c) D : 
(73: R3{X,Z,X),RsiX,Z,X) ^3Y R4{X,Y,Z) 

(74: R4{Xi,Yi,Zi),R4{X2,Y2,Z2) ^ 

R5{X,,Z2,Xi) 

Fig. 3. Modified example with (a) a set S of tgds, (b) a query q and (c) a database D. 

Proof (continued). On an abstract level, the atoms that make up the final rule rg^a] 
of P can be divided into three groups serving three different purposes. That is, 7'goai can 
be considered as a conjunction rt^pies A rchase A rqueiy Each group is "supported" by a 
sub-program of P that defines relations that are used in rgoai, and we refer to these three 
subprograms as Ptupies, -Pchase and Pqueiy, respectively. 

- The purpose of rmpies is basically to lay the ground for the other two. It consists of 
N atoms that allow to guess the symbolic encoding of a sequence S = ti, . . . ,tN . 

- The atoms of rchase are designed to verify that S is an actual chase sequence with 
respect to D. 

- Finally, rqueiy checks that there is a homomorphism from q to S. 

^tuples and rtupies. We continue with an explanation of the symbolic representation 
of tuples underlying rtupies- 

The symbolic representation of the tuples ti of the chase sequence S uses numerical 
values to encode null values, predicate symbols (by i), tgds a-j e S (by j) and the 
number of a tuple ti in the sequence (that is: i). 

In particular, the symbolic encoding uses the following numerical parameters.^ 

- fj to indicate the relation i?,.. to which the tuple belongs; 

- fi to indicate whether ti is from 13 (/^ = ) or yielded by the chase ( fi ^ 1); 

- Furthermore, xn, . . . , Xia represent the attribute values of ti as follows. If the j- 
th attribute of ti is a value from dom(_D) then Xij is intended to be that value, 
otherwise it is a null represented by a numeric value. 

Since each rule of S has at most one existential quantifier in its head, at each chase step, 
at most one new null value can be introduced. Thus, we can unambiguously represent 
the null value (possibly) introduced in the j-th step of the chase by the number j. In 
particular, all null values introduced in a chase sequence (of length N) can indeed be 
represented by elements of the numerical domain. 

The remaining parameters st and c^i , . . . , Cik are used to encode information about 
the tgd and the tuples (atoms) in S that are used to generate the current tuple. More 
precisely, 

- Si is intended to be the number of the applied tgd CTs^ and 

- Cii , • ■ • , Cife are the tuple numbers of the k tuples that are used to yield ti. 
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^ We use the names of the parameters as variable names in rgoai as well. 



In the example, e.g., ^5 is obtained by applying 0-4 to t2 and t^. The encoding of our 
running example can be found in Figure 2 (c). 

We use a new relational symbol T of arity a + k + 4 not present in the schema of 
D for the representation of the tuples from S. Thus, rtupies is just: 

T{l,ri, fi,Xii, . . . ,Xia,Si,Cii, . . . ,Cik), ■ ■ ■, 

TiN, Tn, fN,XNl, ■ ■ ■ ,XNa,SN,CNl, ■ ■ ■ ,CNk)- 

The sub-program Ptupies is intended to "fill" T with suitable tuples. The intention 
is that T contains all tuples that could be used in a chase sequence in principle. At this 
point, there are no restrictions regarding the chase rules. To this end, Ptupies uses two 
kinds of rules, one for tuples from D and one for tuples yielded by the chase. For each 
relation symbol Rj of D, Ptupies has a rule 

T(Z,j,0,Xl,...,Xa,0,0,...,0) : . . . ,X,),Num(Z). 

which adds all tuples from Rj to T and makes them accessible for every possible 
position (Z) in S. 

The following rule adds tuples that can possibly be obtained by chase steps. 

T(Z,y,l,Xi,...,X„y,C/i,...,C/fe) : - 

Num(Z), Num(y), DNum(Xi), . . . , DNum(X„), 
Num(y), Num(C/i), . . . ,Num([/fc), 

1 < y < m, I <v <e,Ui < z,...,Uk < z (i) 

Here, the first two inequalities make sure that only allowed relation and tgd numbers 
are used, the latter inequalities guarantee that to yield a tuple by a chase rule only tuples 
with smaller numbers can be used.'' The rule uses one further predicate DNum that has 
not yet been defined. Its purpose is to contain all possible values, that is: dom(D) UNum. 
It is (easily) defined by further rules of Ptupies- Note that this leaves the values for the 
Xj unconstrained, hence they can carry either domain values or numerical values. 

Pchase and rchase- Next, we describe the part of rgoai that checks that S constitutes 
and actual chase sequence and the rules of P that specify the corresponding auxiliary 
relations. 

The following kinds of conditions have to be checked to ensure that the tuples 
"guessed" by rtupies constitute a chase sequence. 

(1) For every i, the relation i?^; of a tuple ti has to match the head of its rule CTs . . 

- In the example, e.g., r4 has to be 4 as the head of <T2 is an P4-atom. 

(2) Likewise, for each i and j the relation number of tuple t^j has to be the relation 
number of the j-th atom of CTs . . 

- In the example, e.g., r2 must be 4, as C5.1 = 2 and the first atom of cts, = (T4 is 
an P4-atom. 

(3) If the head of CTs^ contains an existentially quantified variable, the new null value is 
represented by the numerical value i. 

* As the latter constraints are independent from the concrete tgds, we decided to put them here. 
They could as well be tested in rdiasc 



- This is illustrated by in the example: the first position of the head of rule 2 
has an existentially quantified variable and thus X4^i = 4. 

(4) If a variable occurs at two different positions in then the corresponding positions 
in the tuples used to produce ti carry the same value. 

(5) If a variable in the body of as- also occurs in the head of then the values of the 
corresponding positions in the body tuple and in ti are equal. 

- Z2 occurs in position 3 of the second atom of the body of 0-4 and in position 2 
of its head. Therefore, 0:4 3 and X5^2 have to coincide (where the 4 is determined 

by C5,2- 

Note that all these conditions depend on the given tgds. Indeed, every tgd from E 
contributes conditions of each of the five forms. For the sake of simplicity of presen- 
tation, we explain the effect of a tgd through the following example tgd that contains 
all relevant features that might arise in a tgd. The generalization to arbitrary tgds is 
straightforward but tedious to spell out in full detail. Let us thus assume that ai is the 
tgd^ 

R2{X,Y),R3{Y,Z) ^3V R4X,V). 

Condition (1) states that if a tuple ti is obtained by applying <ti it should be a tuple 
from i?4. In terms of variables this means, that for every i it should hold: if = 1 then 

n = 4. 

This is the first occasion where we need some way to express a disjunction in rgoai 
(namely: Si 1 V ri = 4). We can meet this challenge with the help of an additional 
predicate to be specified in Pchase- More precisely, we let Pchase specify a 4-ary predicate 
IfThen(Xi, X2, 1/1,1/2) that is intended to contain all tuples fulfilling the condition: if 
Xi = X2 then Ui = 1/2- IfThen can be specified by the following two rules. 

IfThen(X, X, U, U) : -DNum(X), DNum([/). 

IfThen(Xi,X2,t/i,C/2) : - 

DNum(Xi), DNum(X2),DNum(C/i), DNum([/2), Neq(Xi, X2). 

Thus, condition (1) can be guaranteed with respect to tgd ai for all tuples ti by 
adding all atoms of the form IfThen(si, 1, r^, 4) to rchase- 

Condition (2) is slightly more complicated. For our example tgd <ti it says that if a 
tuple ti is obtained using <ti then the first tuple used for the chase step should be an 
J?2-tuple. In terms of variables this can be stated as: if = 1 and cn = j then rj = 2 
(and likewise for the second atom of cti . To express this IF-statement we use a 6-ary 
auxiliary predicate IfThen2(Xi, X2, Yi, I2, Ui, U2) expressing that if Xi ~ X2 and 
Y\ = Y2 then C/i = 1/2- It can be specified in Pchase by the following three rules. 

IfThen2(X, X, r, r, [/, [/) : -DNum(X), DNum(y), DNum(t/). 

^ This example tgd is not related to our running example as that does not have a single tgd with 
all features. 



imien2{Xi,X2,Yi,Y2,Ui,U2) : - 

DNum(Xi ) , DNum(X2 ) , DNum( Yj ) , DNum(r2 ) , DNum([/i ) , DNum([/2 ) , 

irThen2{Xi,X2,Yi,Y2,UuU2) : - Neq{Xi,X2). 
DNum(Xi ) , DNum(X2 ) , DNum( Yi ) , DNum(r2 ) , DNum( t/i ) , DNum({/2 ) , 

Neq(yi,r2). 

For every pair of numbers i, j < N, r^^^] then has atoms IfThen2(si, 1, Cii,j, rj, 2) 
and IfThen2(sj, l,Ci2,j,rj,2). 
In a similar fashion 

- condition (3) yields one atom IfThen(si, 1, Xi2,i), for every i; 

- condition (4) yields one atom lfThen3{si,l, cn, ji, Ci2, j2, Xj-^2i Xj^i), for every 
hjiiji < N, where IfThenS is the 8-ary predicate for IfThen-statements with 
three conjuncts that can be defined analogously as IfThen2; 

- condition (5) yields one atom IfThen2(si, 1, Ciijj, a;ji, Xii) for every i, ; < N. 

Altogether, rchase has 0{N^£k) atoms that together guarantee that the variables of 
ftupies encode an actual chase sequence. 

^query and Tquery Finally, we explain how it can be checked that there is a ho- 
momorphism from q to S. We explain the issue through the little example query 
i?3(a;, y) A Riiy, z). To evaluate this query, rqueiy makes use of two additional vari- 
ables qi and q2, one for each atom of q. The intention is that these variables bind to the 
numbers of the tuples that the atoms are mapped to. We have to make sure two kinds 
of conditions. First, the tuples need to have the right relation symbol and second, they 
have to obey value equalities induced by the variables of q that occur more than once. 

The first kind of conditions is checked by adding atoms IfThen((7i, i, r^, 3) and 
IfThen(g2, *, 4) to rquery, for every i < N. The second kind of conditions can be 
checked by atoms IfThen2(gi ,i,q2,j, Xi2 , 2:^1 ), for every i, j < N . 

As we do not need any further auxiliary predicates, Pquery is empty (but we kept it 
for symmetry reasons). 

This completes the description of P. Note that P is nonrecursive, and has polyno- 
mial size in the size of q and S. In order to finish the proof of part (a) of Theorem 1, 
we next explain how to reduce the arity of P. 

This final step of the construction is based on two ideas. 

First, by using Boolean variables and some new ternary relations, we can replace 
the 6-ary relation IfThen2 (and likewise the 4-ary relation IfThen). More precisely, we 
replace every atom IfThen2(Xi, X2, Yi, i2, t^i, [^2) by a conjunction of the form 

IfEq(Xi, X2, Bi), IfEq(ri, ^2, S2), IfEq(C/i, C/2, S3), NotB(Bi, S^), 

Note (.82 , S^) , OrB (Si ,' , , B4) , OrB (B3 ,54,-65), TrueB (B5 ) • 



Here, NotB,OrB, are predicates that mimic Boolean gates, e.g., OrB(i33, B4, Be,) 
holds if Be, is the Boolean Or of B^ and ^4, in particular all values have to be from 



{0,l}.TraeB(B5) only holds if = 1. The predicate Iffiq(Xi, X2, -Bi) holds if Bi = 
1 and Xi = X2 or if Bi = and Xi ^ X2. The relations IfEq, NotB, OrB, TrueB can 
easily be defined in Pchase- 

The second idea is that T need not be materialized. We only materialize a relation 
T' of arity a + 1 which is intended to represent all database tuples. More precisely, 
T'(j, Xi, . . . , Xa) shall hold if {Xi, . . . , Xa) represents a tuple from relation Rj or if 
j = 0. Clearly, T' can be defined in Ptupies- 

Every tuple r(j,rj,/j Sj,Cji,. .. , Cjk) in tuples is then replaced by a 

conjunction of atoms with the same semantics. The conjunct T'{r'j, Xji, . . . , Xja) tests 
whether {xji, . . . , Xja) is in Rr',. Further atoms ensure that rj = r'j if fj = 0. Finally, 
it is ensured that, if fj = 1 the values are restricted as by the right-hand side of rule 1 . 

In order to prove part (b), we must get rid of the numeric domain (except for 
and 1). This is actually very easy. We just replace each numeric value by a logarithmic 
number of bits (coded by our and 1 domain elements), and extend the predicate arities 
accordingly. As a matter of fact, this requires an increase of arity by a factor of log N = 
0(log l^l). It is well-known that a successor predicate and a vectorized < predicate 
for such bit-vectors can be expressed by a polynomially-sized nonrecursive Catalog 
program, see [10]. The rest is completely analogous to the above proof. This concludes 
the proof sketch for Theorem 1 . 

We would like to conclude this section with some remarks: 

Remark 1. Note that the evaluation complexity of the Datalog program obtained 
for case (b) is not significantly higher than the evaluation complexity of the program P 
constructed for case (a). For example, in the most relevant case of bounded arities, both 
programs can be evaluated in NPTIME combined complexity over a database D. In 
fact, it is well-known that the combined complexity of a Datalog program of bounded 
arity is in NPTIME (see [10]). But it is easy to see that if we expand the signature of 
such a program (and of the underlying database) by a logarithmic number of Boolean- 
valued argument positions (attributes), nothing changes, because the possible values for 
such vectorized arguments are still of polynomial size. It is just a matter of coding. In a 
similar way, the data complexity in both cases (a) and (b) is the same (PTIME). 

Remark 2. It is easy to generalize this result to the setting where q is actually a 
union of conjunctive queries (UCQ). 

Remark 3. The method easily generalizes to translate non-Boolean queries, i.e., 
queries with output, to polynomially-sized nonrecursive Datalog programs with out- 
put. We are here only interested in certain answers consisting of tuples of values 
from the original domain dom{D) (see [12]). Assume that the head of q is an atom 
R{Xi, . . . , Xk) where R is the output relation symbol, and the Xi are variables also 
occurring in the body of q. We then obtain a nonrecursive Datalog translation by acting 
as in the above proof, except for the following modifications. Make R{Xi , . . . , Xk) the 
head of rule rgoai, and add for 1 < i < an atom adom{Xi) to rquery, where adorn, is 
an auxiliary predicate such that adom{u) is iff u is in the active non-numeric domain 
of the database, that is, iff u £ dom{D) and u effectively occurs in the database. It is 
easy to see that the auxiliary predicate adorn itself can be achieved via a nonrecursive 



Datalog program from D. Clearly, by construction of (the so modified) program P, the 
output of P are then precisely the certain answers of the query q. 

Remark 4. The polynomially-sized nonrecursive Datalog program P constructed 
in the proof of Theorem 1 can in turn be transformed in polynomial time into an equiv- 
alent first-order formula. In case of iV-numerical databases this follows immediately 
from the constant depth of (the predicate dependency graph of) P. Moreover, in case of 
non-numerical domains with two distinguished constants, the simulation of a numeri- 
cal domain via bit-vectors can be easily expressed by a polynomially sized first-order 
formula. In summary. Theorem 1 remains valid if we replace "nonrecursive Datalog 
program" by "first-order formula". However, for practical purposes nonrecursive Dat- 
alog may be the better choice, because the auxiliary relations that need to be computed 
only once are already factured out explicitly. 

4 Further Results Derived From the Main Theorem 

We wish to mention some interesting consequences of Theorem 1 that follow easily 
from the above result after combining it with various other known results. 

4.1 Linear TGDs 

A linear tgd [5] is one that has a single atom in its rule body. The class of linear tgds 
is a fundamental one in the Datalog^ family. This class contains the class of inclusion 
dependencies. It was already shown in [14] for inclusion dependencies that classes of 
linear tgds of bounded (predicate) arities enjoy the PWP. That proof carries over to 
linear tgds, and we thus can state: 

Lemma 1. Classes of linear tgds of bounded arity enjoy the PWP. 

By Theorem 1, we then conclude: 

Theorem 2. Conjunctive queries under linear tgds of bounded arity are polynomially 
rewritable as nonrecursive Datalog programs in the same fashion as for Theorem 1. So 
are sets of inclusion dependencies of bounded arity. 

4.2 DL-Lite 

A pioneering and highly significant contribution towards tractable ontological reasoning 
was the introduction of the DL-Lite family of description logics (DLs) by Calvanese et 
al. [9, 20]. DL-Lite was further studied and developed in [1]. 

A DL-lite theory (or TBox) S = , 17+) consists of a set of negative constraints 
such as key and disjointness constraints, and of a set of positive constraints 
that resemble tgds. As shown in [9], the negative constraints IJ~ can be compiled into a 
polymomially sized first-order formula (actually a union of conjunctive queries) of the 
same arity as U~ such that for each database and BCQ q, {D, S) |= q iff _D ^ S~ 
and {D, 17+) ^ q. In (the full version of) [5] it was shown that for the main DL-Lite 
variants defined in [9], each 17+ can be immediately translated into an equivalent set of 
linear tgds of arity 2. By virtue of this, and the above we obtain the following theorem. 



Theorem 3. Let q be a CQ and let S = (U , S^) be a DL-Lite theory expressed 
in one of the following DL-Lite variants: DL-Lite jr ^, DL-Lite-ji j-], DL-Lite'^ p, DLR- 
Litejr DLR-Lite-ji n, or DLR-Lite^ p. Then 17+ can be rewritten into a nonrecursive 
Datalog program P such that for each database D, {D,E^) \= q iff D \= P. Regarding 
the arities of P, the same bounds as in Theorem 1 hold. 

4.3 Sticky and Sticky Join TGDs 

Sticky tgds [6] and sticky-join tgds [6] are special classes of tgds that generalize linear 
tgds but allow for a limited form of join (including as special case the cartesian product). 
They allow one to express natural ontological relationships not expressible in DLs such 
as OWL. We do not define these classes here, and refer the reader to [8]. By results 
of [8], which will also be discussed in more detail in a future extended version of the 
present paper, both classes enjoy the Polynomial Witness Property. By Theorem 1, we 
thus obtain the following result: 

Theorem 4. Conjunctive queries under sticky tgds and sticky-join tgds over a fixed 
signature TZ are rewritable into polynomially sized nonrecursive Datalog programs of 
arity bounded as in Theorem L 

5 Related Work on Query Rewriting 

Several techniques for query-rewriting have been developed. An early algorithm, in- 
troduced in [9] and implemented in the QuOnto system**, reformulates the given query 
into a union of CQs (UCQs) by means of a backward-chaining resolution procedure. 
The size of the computed rewriting increases exponentially w.r.t. the number of atoms 
in the given query. This is mainly due to the fact that unifications are derived in a 
"blind" way from every unifiable pair of atoms, even if the generated rule is superflu- 
ous. An alternative resolution-based rewriting technique was proposed by Perez-Urbina 
et al. [19], implemented in the Requiem system**, that produces a UCQs as a rewrit- 
ing which is, in general, smaller (but still exponential in the number of atoms of the 
query) than the one computed by QuOnto. This is achieved by avoiding many use- 
less unifications, and thus the generation of redundant rules due to such unifications. 
This algorithm works also for more expressive non-first-order rewritable DLs. In this 
case, the computed rewriting is a (recursive) Datalog query. Following a more gen- 
eral approach. Call et al. [3] proposed a backward-chaining rewriting algorithm for the 
first-order rewritable Datalog^ languages mentioned above. However, this algorithm is 
inspired by the original QuOnto algorithm, and inherits all its drawbacks. In [13], a 
rewriting technique for linear Datalog^ into unions of conjunctive queries is proposed. 
This algorithm is an improved version of the one already presented in [3], where fur- 
ther superfluous unifications are avoided, and where, in addition, tedundant atoms in the 
body of a rule, that are logically implied (w.rt. the ontological theory) by other atoms 

^ http://www.dis.uniromal.it/ quonto/ 

' http://www.comlab.ox.ac.uk/projects/requiem/home.html 



in the same rule, are eliminated. This elimination of body-atoms implies the avoidance 
of the construction of redundant rules during the rewriting process. However, the size 
of the rewriting is still exponential in the number of query atoms. 

Of more interest to the present work are rewritings into nonrecursive Catalog. 
In [15, 16] a polynomial-size rewriting into nonrecursive Datalog is given for the de- 
scription logics DL-Litej^jj^^ and DL-Lite;,o,.,i. For DL-Lite^^^j^, a DL with counting, a 
polynomial rewriting involving aggregate functions is proposed. It is, moreover, shown 
in (the full version of) [15] that for the description logic DL-Litejr a polynomial-size 
pure first-order query rewriting is possible. Note that neither of these logics allows for 
role inclusion, while our approach covers description logics with role inclusion axioms. 
Other results in [15, 16] are about combined rewritings where both the query and the 
database D have to be rewritten. A recent very interesting paper discussing polynomial 
size rewritings is [22]. Among other results, [22] provides complexity-theoretic argu- 
ments indicating that without the use of special constants (e.g, and 1, or the numerical 
domain), a polynomial rewriting such as ours may not be possible. Rosati et al. [21] 
recently proposed a very sophisticated rewriting technique into nonrecursive Datalog, 
implemented in the Presto system. This algorithm produces a non-recursive Datalog 
program as a rewriting, instead of a UCQs. This allows the "hiding" of the exponential 
blow-up inside the rules instead of generating explicitly the disjunctive normal form. 
The size of the final rewriting is, however, exponential in the number of non-eUminable 
existential join variables of the given query; such variables are a subset of the join vari- 
ables of the query, and are typically less than the number of atoms in the query. Thus, 
the size of the rewriting is exponential in the query size in the worst case. Relevant 
further optimizations of this method are given in [18]. 
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